SoMaJo: State-of-the-art tokenization for German web and social media texts

نویسندگان

  • Thomas Proisl
  • Peter Uhrig
چکیده

In this paper we describe SoMaJo, a rulebased tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenomena its rules cover, as well as a detailed error analysis. The tokenizer is available as

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A POS Tagger for Social Media Texts Trained on Web Comments

Using social media tools such as blogs and forums have become more and more popular in recent years. Hence, a huge collection of social media texts from different communities is available for accessing user opinions, e.g., for marketing studies or acceptance research. Typically, methods from Natural Language Processing are applied to social media texts to automatically recognize user opinions. ...

متن کامل

EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of ComputerMediated Communication / Social Media”. Our system is based on a rulebased tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer me...

متن کامل

LTL-UDE $@$ EmpiriST 2015: Tokenization and PoS Tagging of Social Media Text

We present a detailed description of our submission to the EmpiriST shared task 2015 for tokenization and part-of-speech tagging of German social media text. As relatively little training data is provided, neither tokenization nor PoS tagging can be learned from the data alone. For tokenization, our system uses regular expressions for general cases and word lists for exceptions. For PoS tagging...

متن کامل

The Role of the German Researchers in the Formation of Islamic Art Studies

In the beginning of the nineteenth century, with the increasing interest of the Europeans in the culture of the East, the first articles on the Islamic art and culture were appeared in German-speaking countries. In the mid nineteenth century, some entries in German encyclopedias were devoted to Islamic art, and from the end of the century, the first monographs on Islamic architecture and orname...

متن کامل

The TextPro Tool Suite

We present TextPro, a suite of modular Natural Language Processing (NLP) tools for analysis of Italian and English texts. The suite has been designed so as to integrate and reuse state of the art NLP components developed by researchers at FBK. The current version of the tool suite provides functions ranging from tokenization to chunking and Named Entity Recognition (NER). The system‟s architect...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016